home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Libris Britannia 4
/
science library(b).zip
/
science library(b)
/
MATH
/
MFLOAT10.ZIP
/
MFLOAT.DOC
< prev
next >
Wrap
Text File
|
1993-04-28
|
71KB
|
1,679 lines
MFLOAT Version 1.00
===================
A brief description of "MFLOAT":
* "MFLOAT" handles arithmetic using high precision numbers of up to 72 decimal
digits.
* "MFLOAT" is optimized for maximum speed for high precision numbers.
These subroutines are the fastest for the 8086 microprocessor known to us.
If you find faster subroutines for a 8086, please send me a mail.
* "MFLOAT" subroutines are much faster than a standard emulation of a
coprocessor for the same accuracy. "MFLOAT" does not use the coprocessor.
* "MFLOAT" is written in assembler for the 8086 microprocessor and
compatibles. The calculation is 10 to 20 times faster than that of
subroutines written in C or PASCAL. This is important for tedious
calculations like an optimization, where you have problems with the
accuracy. For this problems you can avoid the use of a super computer
which you feed with a C or PASCAL program for high precision numbers.
* All important functions of PASCAL and of a C library ("MATH.H") are
included. This includes the transcedental functions like sin(x), atan(x),
exp(x), log(x)....
* Programs written in C or C++, can use MFLOAT simply by replacing the
double data format with the "MFLOAT" data format. The operators and the
functions are overloaded for C++ and so only little changes of the source
code are necessary. (You need BORLAND C 3.1).
* The subroutines are included in compiled form (object files) in a program.
Therfore you can use them for PASCAL, C and C++. The subroutines are
tested for BORLAND PASCAL 7.0, TURBO PASCAL 6.0 and BORLAND C 3.1.
* This package is limited to about 72 decimal digits. You may license the
source code. Then you can extend the precision for the algebraic
calculations up to 10000 decimal digits (upper limit is due to the memory
limit of one segment of the 8086) and for transcendental function up to 500
digits.
* The internal data format is a binary floating point format. The length of
the mantissa can be choosen by the user from one word to 15 words.
(1 word = 16 bits). This correspondeds to about 5, 10, 14, 19, 24,
29, 34, 39, 43, 48, 53, 58, 63, 67 and 72 digits of a decimal number.
1 word represents 4.81648 digits of a decimal number in average.
The exponent of two has the length of 16 bits. The largest representable
number is about 7.07E+9863, the smallest is about 7.07E-9865.
***************************************************************************
The package includes following files:
READ.ME - general information
MFLOAT.DOC - describtion of the subroutines and mfloat numbers
MFLOATA.OBJ - subroutines written in assembler
MFLOATB.OBJ - subroutines written in C
MFLOAT.H - header file for C
MFLOAT.CXX - file for C++ : overloading functions and operators
PFLOAT.PAS - unit: declarations of the external subroutines
PI.PAS - example in PASCAL, calculation of pi
BESSEL.PAS - example in PASCAL, calculation of the bessel function
POLAR.PAS - example in PASCAL, conversion cartesian to polar coordinates
PI.C - example in C, calculation of pi
BESSEL.C - example in C, calculation of the bessel function
POLAR.C - example in C, conversion cartesian to polar coordinates
PI.CPP - example in C++, calculation of pi
BESSEL.CPP - example in C++, calculation of the bessel function
POLAR.CPP - example in C++, conversion cartesian to polar coordinates
PI.EXE - example program executable
The source code option includes further files:
MFLOATA.ASM - source code in assembler (basic subroutines)
MFLOATB.C - source code in C (transcendental functions and extensions)
MFLOATC.ASM - constants (include file of "MFLOATA.ASM")
CALCCON.PAS - calculation of constants in "MFLOATC.ASM" in PASCAL source
CALCCON.EXE - calculation of constants in "MFLOATC.ASM"
Registration:
see READ.ME
The authors of MFLOAT are interested in a feedback from the users of "MFLOAT".
If you have problems or you find some bugs (the subroutines are checked very
thorougly, but the probability, that a software of large dimension has no bug
is not large in reality), please describe them and send us a mail per
internet. You help us to remove this errors of the subroutines.
Many thanks for your help and much success with "MFLOAT".
Kaufmann Friedrich & Mueller Walter
Students at the Technical University of Graz
surface mail:
Kaufmann Friedrich
Schuetzenhofgasse 22
A-8010 GRAZ
AUSTRIA
EUROPE
email address:
fkauf@fstgds06.tu-graz.ac.at
****************************************************************************
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
****************************************************************************
Multiprecision calculations with "MFLOAT"
=========================================
"MFLOAT" is a package of subroutines written in assembler for a 8086 micro-
processor or compatibles. It is made for users having problems with the
accuracy of standard IEEE arithmetic.
The precision of IEEE-754 standard floating point numbers is good enough
for most technical and scientific applications. But sometimes a scientist
or a mathematician has problems in his calculations because the algorithm or
the mathematical problem is very ill-conditioned (for example finding the
roots of a polynomial). As the floating point numbers are only appoximations
of real or rational numbers, there is an additional error at every floating
point operation.
There is no method known to represent every real number exactly in a
binary format and therefore an approximation is necessary. A practical method
is to extend the mantissa of the floating point number, which results in
less error at every operation. But there are limitations because of higher
memory needs for every floating point number and increased execution time.
This is why a scientist having numerical problems of this kind is looking
for very fast subroutines for multiprecision floating point numbers.
This package represents such a package of subroutines. It uses a binary
data format because there are optimal instructions for the manipulation of
binary numbers in the instruction set of the 8086. A binary data format has
also the advantage, that the used memory is minimized. The only disadvantage
is the higher effort that is necessary for the input and output of the
numbers in ASCII-Format.
The main goal is a minimal computing time for very accurate numbers.
Therefore no time is wasted for rounding (every result is truncated) and
the handling of abnormal numbers as NANs or denormals. The only special
number is zero. For saving time, there is no stack overflow check.
In case of an overflow or some other calculation error an error flag is set,
which can be tested during or at the end of the calculation. Of course in
such a case the numerical result is wrong. In case of an underflow the result
is set to zero without any message.
A multiprecision number is consists of an array of words (1 word = 2 byte
= 16 bit). The first word represents the exponent to the base of two. It is
represented with an offset of 8000 Hex (= 32768) and ranges from 1 to
FFFF Hex (= 65535). The special value 0 is used to represent the floating
point number zero. In this case the rest of the array is arbitrary. The
second word of the array represents the first word of mantissa. The mantissa
is normalized for every floating point number and therefore the most
significant bit of the first word of mantissa is always set. As there is no
information in this bit, it is removed and at its place the sign bit is
stored. The sign bit is one if the floating point number is negative, and
zero if the floating point number is positive. The further words of
the multiprecision number are words of mantissa. The length of mantissa
ranging from one to a constant (15 in this release) can be choosen by the
user.
Example : 8002 490F Hex represents 3.141540528..
Every word of mantissa represents in average 4.8165 (= 16 * log10(2)) digits
of a decimal number. Therefore you need 15 words of mantissa for a accuracy
of 72 decimal digits and the whole number uses 16 words or 32 bytes.
If you have the source code, you can increase the accuracy. The upper limit
of the accuracy is given by the used stack of the subroutines, which is
about 3 times the memory of a "MFLOAT" number for the arithmetic operations
addition, subtraction, multiplication and division. The stack is limited to
one segment of the 8086 which is 65536 byte. A further limit for
transcendental functions is 125 mantissa words (602 digits of a
decimal number) for the series of the sine. If you extend this limit, you
will get an error, for an argument near pi/4.
"MFLOATA.OBJ" includes 19 "MFLOAT" constants. This constants are stored in
the code segment, which is limited to 65536 bytes.
There is a very large range for the exponent. The greatest number, that can
by represented is about +/-7.07E+9863, the smallest is about +/-7.07E-9865.
Now lets take a look inside the subroutines. The coprocessor 8087 or a
compatible is not used because the IEEE format does not match the data
format. There are subroutines for the algebraic operations addition,
subtraction, multiplication and division.
Addition and subtraction are done by the same subroutine. The time for an
addition (or subtraction) is proportional to the actual number of mantissa-
words.
The implementation of the multiplication uses the instruction "MUL".
which is much faster than shifted additions. The multiplication result
is two times as large as its operands but only the half of the accuracy is
needed. Therfore its a waste of time to calculate the second part of the
mantissa. If exactly half the number of mantissawords are calculated and the
rest is neglected, there is an accumulation of truncating errors, which can
be very large. Therfore an additional extra word is added to the result during
calculation, but for the final result this extra word is ignored. By this
way half of the calculation time is saved and only small additional errors
occure. The calculation time increases quadratic with the number of
mantissawords.
The division is implemented using an algorithm similar to hand calculation.
First an estimate of the first word of the result is made and the rest is
calculated. If the rest is negativ, the estimation is corrected until it is
ok. Using the rest the next word of mantissa is estimated and eventually
corrected. This goes on until the whole mantissa of the result is calculated.
For saving calculation time, the rest is truncated at every step by one word.
Only at the first step an extra word is added (no truncation). Because of this
method half of the calculation time is saved (same priciple as by the
multiplication). The calculation time increases quadratic with the number of
mantissawords.
The first transcendental function is the square root. There is a hand
calculation method known similar to a division. The implemented algorithm
uses this method. For saving calcultion time the same tricks are used as
in the implementation of multiplication and division. The calculation time is
smaller than those of a division. There are also iterative methods known to
calculate the square root. These methods are much slower than the implemented
method. (3 - 5 times slower).
Further trancendental functions are the trigonometric, hyperbolic functions,
the exponential function and their invers functions. This functions can
be derived from four series:
sin(x) = x - x^3/6 + x^5/120 - .....
sinh(x) = x + x^3/6 + x^5/120 + .....
arctan(x) = x - x^3/3 + x^5/5 - ..... |x| < 1
artanh(x) = x + x^3/3 + x^5/5 + ..... |x| < 1
These are the basic functions for further calculations. All other functions
can be derived from them. Here are some examples:
cos(x) = sin(x + pi/2)
tan(x) = sin(x) / cos(x)
cot(x) = cos(x) / sin(x)
cosh(x) = sqrt(1 + sqr(sinh(x))
tanh(x) = sinh(x) / cosh(x)
coth(x) = cosh(x) / sinh(x)
exp(x) = sinh(x) + cosh(x)
arcsin(x) = arctan(x / sqrt(1 - sqr(x)))
arccos(x) = pi/2 - arcsin(x)
arccot(x) = pi/2 - arctan(x)
arsinh(x) = artanh(x / sqrt(1 + sqr(x)) = ln(x + sqrt(1 + sqr(x)))
arcosh(x) = artanh(sqrt(sqr(x) - 1) / x) = ln(x + sqrt(sqr(x) - 1))
arcoth(x) = artanh(1/x)
ln(x) = 2 * artanh((x - 1) / (x + 1))
Some identities are used to save calculation time:
sin(x + 2 * pi) = sin(x)
exp(x + ln(2)) = 2 * exp(x)
....
and so on.
Most of the time is used for the calculation of the series. Therefore it is
important to write an optimal assembler subroutine for the whole series.
This series could also be programmed directly by the former subroutines
but using a specialized assembler subroutine the calculation time can be
reduced to one third at even higher accuracy. The calculation time for a
series increases with the third power of the number of mantissawords.
The subroutines are testet using the following compilers:
Turbo Pascal 6.0, Borland Pascal 7.0, Borland C/C++ 3.1
You may use the library in combination with other compilers but in this
case you may have to adabt the header files. Please note, that pascal
calling conventions (upercase only, argument order, no underlines, ...)
is implemented. All routines are 'far' with 'far' data pointers.
===========================================================================
***************************************************************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************************
===========================================================================
Using the subroutines for BORLAND C 3.1 :
=========================================
To make the functions available you have to include MFLOAT.H in your source
and to specify MFLOATA.OBJ and MFLOATB.OBJ in your project file.
#define mfloatwords 16 /* maximum precision */
typedef int mfloata[mfloatwords];/* mfloat array */
#define mfloat_ int far * /* pointer to array */
#define mfloat mfloata /* for use in 'C' programs */
(declared in "MFLOAT.H")
This type declaration for multiprecision floating point numbers sets
a maximal precision of 15 mantissawords or about 72 decimal digits. The
maximum precision is an assembly time constant. In this release it is fixed
to 15 mantissawords. It is possible to change the type declaration to reduce
the number of mantissawords. This leads to a saving in memory needs and to
higher execution speed at the cost of precision. It is also possible to adjust
the precision dynamically using the function 'setmantissawords'.
The programmer must take care, that the stack is large enough, because
there is no check of a stack overflow.
****************************************************************************
++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
****************************************************************************
void far pascal setmantissawords(int number);
This procedure sets the length of the mantissa of mfloat numbers. The number
of word must be in the range of [1..15], otherwise is is bounded to these
values. If the number of mantissawords is increased and the numbers of
former calculations are used further, the additional words are not
initialized to zero, their values are not known. The error of the number
due to this further mantissa words is of course very small and is in the
range of the accuracy of the former calculation. If it is important, that
this further mantissa words are zero, the user has to initialize them (see
data format). If the number of mantissa words is reduced, every number can
be seen as truncated, although nothing is changed at this numbers. The number
of mantissa words refers only to calculations.
****************************************************************************
int far pascal getmantissawords(void);
The function returns the actual number of mantissa words.
****************************************************************************
void far pascal reseterror(void);
The error flag is reset.
****************************************************************************
char far pascal geterror(void);
The error flag is returned. If there ocures a runtime error during the
calculation, the error flag is set to true (=1).
****************************************************************************
mfloat_ far pascal equm(mfloat_ a, /* a <-- b */
const mfloat_ b);
The value of b is copied to variable a with the defined accuracy. The value
of b is not changed. The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal addm(mfloat_ a, /* a <-- a + b */
const mfloat_ b);
The value of b is added to the value a with the defined accuracy. The
result is variable a. The value of b is not changed. The return value is a
pointer to a.
****************************************************************************
mfloat_ far pascal subm(mfloat_ a, /* a <-- a - b */
const mfloat_ b);
The value of b is subtracted from the value a with the defined accuracy.
The result is variable a. The value of b is not changed. The return value is
a pointer to a.
****************************************************************************
mfloat_ far pascal multm(mfloat_ a, /* a <-- a * b */
const mfloat_ b);
The value of a is multiplied by value b with the defined accuracy. The result
is variable a. The value of b is not changed. The return value is a pointer
to a.
****************************************************************************
mfloat_ far pascal divm(mfloat_ a, /* a <-- a / b */
const mfloat_ b);
The value of a is divided by the value b with the defined accuracy.
The result is variable a. The value of b is not changed. The return value is
a pointer to a.
****************************************************************************
mfloat_ far pascal multi(mfloat_ a, /* a <-- a * b */
int b);
The value of a is multiplied by the integer value b with the defined
accuracy. The result is variable a. The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal divi(mfloat_ a, /* a <-- a / b */
int b);
The value of a is divided by the integer value b with the defined accuracy.
The result is variable a. The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal inversm(mfloat_ a); /* a <-- 1 / a */
The number one is divided by the number a with the defined accuracy.
The result is variable a. The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal negm(mfloat_ a); /* a <-- - a */
This procedure changes the sign of the number of variable a. The return value
is a pointer to a.
****************************************************************************
char far pascal eqzero(const mfloat_ a); /* eqzero <-- (a == 0) */
This function checks, if the multiprecision number a is zero. The value of a
is not changed. The return value is 0 or 1.
****************************************************************************
char far pascal gtzero(const mfloat_ a); /* gtzero <-- (a > 0) */
This function checks, if the multiprecision number a is greater than zero.
The value of a is not changed. The return value is 0 or 1.
****************************************************************************
char far pascal gezero(const mfloat_ a); /* gezero <-- (a >= 0) */
This function checks, if the multiprecision number a is greater than zero or
equal zero. The value of a is not changed. The return value is 0 or 1.
****************************************************************************
char far pascal gtm(const mfloat_ a, /* gtm <-- (a > b) */
const mfloat_ b);
This function checks, if the multiprecision number a is greater than the
multiprecision number b. The values of a and b are not changed. The return
value is 0 or 1.
****************************************************************************
char far pascal eqm(const mfloat_ a, /* eqm <-- (a == b) */
const mfloat_ b);
This function checks, if the multiprecision number a is equal to the
multiprecision number b. The values of a and b are not changed. The return
value is 0 or 1.
****************************************************************************
mfloat_ far pascal getzerom(mfloat_ a); /* a <-- 0 */
This procedure copies the value zero = 0.0000.. to the variable a. The return
value is a pointer to a.
****************************************************************************
mfloat_ far pascal getonem(mfloat_ a); /* a <-- 1 */
This procedure copies the value one = 1.0000.. to the variable a. The return
value is a pointer to a.
****************************************************************************
mfloat_ far pascal getpim(mfloat_ a); /* a <-- pi */
This procedure copies the value pi = 3.14159.. to the variable a. The return
value is a pointer to a.
****************************************************************************
mfloat_ far pascal getln2m(mfloat_ a); /* a <-- ln(2) */
This procedure copies the value ln(2) = 0.69314.. to the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal getln10m(mfloat_ a); /* a <-- ln(10) */
This procedure copies the value ln(10) = 2.3025.. to the variable a.
The return value is a pointer to a.
****************************************************************************
int far pascal strtomf(mfloat_ a, /* a <-- b */
const char far * b);
This procedure converts the string b to the multiprecision number a. If the
conversion is correct, the integer zero is returned. Otherwise the number
of the character, where the first error occurs, is retured. For conversion
the accuracy of the calculation is set to maximum and afterwards is set back
to the original value. This function is not optimized for speed, but for
accuracy, because it is not used very often normally.
****************************************************************************
char far * far pascal mftoa(char far * str, /* str <-- a */
const mfloat_ a,
int len);
This procedure converts the multiprecision number a to a string str with
the number of ASCII - characters, which is defined in len. If the number
of characters is too less, the string is set to stars '*'. The format used
for this conversion is '.32767F' (see mftostr). For conversion the
accuracy of the calculation is set to maximum and afterwards is set back to
the original value. This function is not optimized for speed, but for
accuracy, because it is not used very often normally.
The return value is a pointer to the character array a.
****************************************************************************
char far * far pascal mftostr(char far * str, /* str <-- a */
const mfloat_ a,
int len,
const char far * format);
This procedure converts the multiprecision number a to a string str with
a maximum number of ASCII - characters, which is defined in len. If the
number of characters is too less, the string is set to stars '*'.
The characters in the format - string have following meaning:
'.' the point is shown ervery time, not significant zeros behind the point
are shown
'E' scientific format with an exponent of ten
'F' fixpoint format, if the number is too large, a scientific format
with an exponent of ten is used.
'G' fixpoint format, if the number is too large, or is less than 0.0001,
a scientific format is used.
'1..' an integer, which declares the number of digits behind the point
for fixpoint and scientific format.
The order of the characters (but not the digits for the precision) in the
format string is arbitrary. It is also allowed to use lower case characters
for the format, but in this case the letter for the exponent of ten is
also the lower case letter 'e'. For conversion the accuracy of the
calculation is set to maximum and afterwards is set back to the original
value. This function is not optimized for speed, but for accuracy, because
it is not used very often normally.
The return value is a pointer to the character array a.
****************************************************************************
double far pascal mftod(const mfloat_ b); /* mftod <-- b */
This function converts a multiprecision number to a double number.
The value of b is not changed.
****************************************************************************
long double far pascal mftold(const mfloat_ b); /* mftold <-- b */
This function converts a multiprecision number to a long double number.
The value of b is not changed.
****************************************************************************
mfloat_ far pascal dtomf(mfloat_ a, /* a <-- b */
double b);
This procedure converts a double precision number to a multiprecision number.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal ldtomf(mfloat_ a, /* a <-- b */
long double b);
This procedure converts a long double number to a multiprecision number.
The return value is a pointer to a.
****************************************************************************
+++++++++++++++ standart functions (Borland C: MATH.H) +++++++++++++++
****************************************************************************
mfloat_ far pascal acosm(mfloat_ a); /* a <-- acos(a) */
This procedure calculates the invers cos(x) function of the argument a. The
result is the variable a.
-1 <= a <= 1 range of the argument, otherwise the error flag is set
0 <= a <= pi range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal asinm(mfloat_ a); /* a <-- asin(a) */
This procedure calculates the invers sin(x) function of the argument a. The
result is the variable a.
-1 <= a <= 1 range of the argument, otherwise the error flag is set
-pi/2 <= a <= pi/2 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal atanm(mfloat_ a); /* a <-- atan(a) */
This procedure calculates the invers tan(x) function of the argument a. The
result is the variable a.
-pi/2 < a < pi/2 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal atan2m(mfloat_ a, /* a <-- atan2(a,b) */
const mfloat_ b);
This function calulates the extended arctan function phi = arctan(a/b). If
a is zero or less than zero, the correct value phi is retured, which is
in the intervall (-pi, pi]:
a = r * sin(phi)
b = r * cos(phi)
r = sqrt(sqr(a) + sqr(b)) >= 0
If a = 0 and b = 0, the value phi is not unique. The error flag is set and
the number zero is returned. The return variable is the variable a. Variable
b is not changed.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal ceilm(mfloat_ a); /* a <-- ceil(a) */
The least integer number, which is greater than or equal the mfloat number
a, is calculated and returned to the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal cosm(mfloat_ a); /* a <-- cos(a) */
This procedure calculates the cos(x) function of the argument a. The result
is the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal coshm(mfloat_ a); /* a <-- cosh(a) */
This procedure calculates the cosh(x) function of the argument a. The result
is the variable a.
-22712 < a < 22712 range of the argument without an overflow
a >= 1 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal expm(mfloat_ a); /* a <-- exp(a) */
This procedure calculates exponential function exp(x) of the argument a. The
result is the variable a.
a < 22712 range of the argument without an overflow
a < -22712 there is an underflow result = 0
a >= 0 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal fabsm(mfloat_ a); /* a <-- fabs(a) */
This procedure returns the absolute value of the number a. The result is
the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal floorm(mfloat_ a); /* a <-- floor(a) */
The greatest integer number, which is less than or equal the mfloat number
a, is calculated and returned to the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal fmodm(mfloat_ a, /* a <-- fmod(a,b) */
const mfloat_ b);
This is a floating point modul function. The number a is divided by b, so
that the result is an integer. The rest of this division is returned by the
variable a. The variable b is not changed.
a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a
If the variable b is zero, the error flag is set. For negative b, the result
is negativ or zero. For very large k, which can not be represented as an
mfloat number exactly, the result is zero.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal frexpm(mfloat_ a, /* a <-- frexp(a,b) */
int * b);
The mfloat number a is splitted in the exponent and in the mantissa:
a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a
This procedure returns two results. IF the mfloat number is zero, it remains
zero and b is set to zero.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal hypotm(mfloat_ a, /* a <-- hypot(a,b) */
const mfloat_ b);
This function is used for the calculation of the absolute value of a complex
number:
c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged.
|a + i*b| = hypot(a,b);
This procedure calculates the correct result also if sqr(a) can not be
represented as a mfloat number, because it is too great or too small.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal ldexpm(mfloat_ a, /* a <-- ldexp(a,b) */
int b);
The mfloat number a is multiplied by the power of two 2**b.
a <-- a * (2**b);
The calculation is very fast because only the exponent of two is manipulated.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal logm(mfloat_ a); /* a <-- log(a) */
This procedure calculates the natural logarithm ln(x) of the argument a. The
result is the variable a.
a > 0 range of the argument
-22712 <= a <= 22712 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal log10m(mfloat_ a); /* a <-- log10(a) */
This procedure calculates the decade logarithm log(x) of the argument a. The
result is the variable a.
a > 0 range of the argument
-9864 <= a <= 9863 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal modfm(mfloat_ a, /* a,b <-- modf(a) */
mfloat_ b);
This procedure splits the number a in an integer part and in an fractional
part:
a = k + b; k is an integer; 0 <= b < 1; k is assigned to a
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal powm(mfloat_ a, /* a <-- pow(a,b) */
const mfloat_ b);
This procedure calculates the power of the basis a by the exponent b.
The result is assigned to the variable a, the variable b is not changed.
If the exponent is an integer, the calculation is made by repeated
multiplications. The number of multiplications is proportional to
log(|b|). If b has a fractional, the calculation is made by logarithm. In
this case, the base a must be positiv, otherwise the error flag is set,
because the result is not real. If the number b is very large so that
no fractional can be represented by the mfloat format and the base
is negativ, the error flag is set too. If a = 0 and b <= 0 the result is
not defined and the error flag is set.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal pow10m(mfloat_ a, /* a <-- 10**b */
int b);
The power of the basis ten is build with the integer exponent b. The result
is assigned to a.
a <-- 10**b
if b < -9864 then a = 0
if b > 9863 then there is an overflow -> error flag is set
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal sinm(mfloat_ a); /* a <-- sin(a) */
This procedure calculates the sin(x) function of the argument a. The result
is the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal sinhm(mfloat_ a); /* a <-- sinh(a) */
This procedure calculates the sinh(x) function of the argument a. The result
is the variable a.
-22712 < a < 22712 range of the argument without an overflow
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal sqrtm(mfloat_ a); /* a <-- sqrt(a) */
This procedure calculates the square root of the number a. The result is
variable a. If a is negativ, the error flag is set.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal tanm(mfloat_ a); /* a <-- tan(a) */
This procedure calculates the tan(x) function of the argument a. The result
is the variable a. It is faster than the calculation of sin(a) / cos(a).
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal tanhm(mfloat_ a); /* a <-- tanh(a) */
This procedure calculates the tanh(x) function of the argument a. The result
is the variable a.
The return value is a pointer to a.
****************************************************************************
++++++++++++++++++ extended standart functions ++++++++++++++++++++++++
****************************************************************************
mfloat_ far pascal acoshm(mfloat_ a); /* a <-- acosh(a) */
This procedure calculates the invers cosh(x) function of the argument a. The
result is the variable a.
1 <= a range of the argument
0 <= a < 22712 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal acotm(mfloat_ a); /* a <-- acot(a) */
This procedure calculates the invers cot(x) function of the argument a. The
result is the variable a.
0 < a < pi range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal acothm(mfloat_ a); /* a <-- acoth(a) */
This procedure calculates the invers coth(x) function of the argument a. The
result is the variable a.
-1 > a or 1 < a range of the argument
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal asinhm(mfloat_ a); /* a <-- asinh(a) */
This procedure calculates the invers sinh(x) function of the argument a. The
result is the variable a.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal atanhm(mfloat_ a); /* a <-- atanh(a) */
This procedure calculates the invers tanh(x) function of the argument a. The
result is the variable a.
-1 < a < 1 range of the argument
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal cossinm(mfloat_ a, /* a <-- cos(a), */
mfloat_ b); /* b <-- sin(a) */
This procedure calculates the sin(x) and cos(x) function by only one
evulation of the series for the sine function. This procedure is therefore
faster than two individual calculations. It is recommended for example
for the calculation of the exponential function of complex arguments.
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal cotm(mfloat_ a); /* a <-- cot(a) */
This procedure calculates the cot(x) function of the argument a. The result
is the variable a. It is faster than the calculation of cos(a) / sin(a).
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal cothm(mfloat_ a); /* a <-- coth(a) */
This procedure calculates the coth(x) function of the argument a. The result
is the variable a.
range of the result: a >=1 or a <= -1
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal exp10m(mfloat_ a); /* a <-- 10 ** a */
This procedure calculates the power of 10 of the argument a. The
result is the variable a.
a < 9863 range of the argument without an overflow
a < -9864 there is an underflow, result = 0
a >= 0 range of the result
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal sqrm(mfloat_ a); /* a <-- sqr(a) */
This procedure calculates the sqare of a. The result is variabel a.
sqr(a) = a * a
The return value is a pointer to a.
****************************************************************************
mfloat_ far pascal truncm(mfloat_ a); /* a <-- trunc(a) */
This procedure returns the integer part of the number a (truncation). The
result is variable a. The sign has no influence and remains unchanged.
The return value is a pointer to a.
===========================================================================
***************************************************************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************************
===========================================================================
Using the subroutines for Borland C++ 3.1 :
===========================================
Using the mfloat library with C++ all of the mathematical operators are
overloaded to allow 'normal' calculations with the mfloat class.
To make the functions available you have to include MFLOAT.CXX (includes
MFLOAT.H) in your source and to specify MFLOATA.OBJ and MFLOATB.OBJ
in your project file.
All operators available for the type 'double' are also defined for 'mfloat',
So it is very easy to convert a program using double variables to use the
mfloat library. Please note that it is not allowed to assign 'mfloat'
variables to other numeric types. To solve this problem please use 'mftod'
and 'mftold' (see the example programs!).
overloaded operators:
=
++ (prefix, suffix) -- (prefix, suffix)
+ (unary, binary) - (unary, binary)
* /
+= -= *= /=
< > <= >= == !=
<< (output to stream) >> (input from stream)
The following functions are overloaded mfloat equivalents of the
functions defined in 'math.h':
acos, asin, atan, atan2, ceil, cos, cosh, exp, fabs, floor, fmod, frexp,
hypot, ldexp, log, log10, modf, pow, sin, sinh, sqrt, tan, tanh.
Instead of 'atof' use 'atofm' (overloading problem).
Instead of 'pow10' use 'pow10m' (overloading problem).
The following functions are extensions to math.h:
acosh, asinh, atanh, acoth, cot, coth, exp10, sqr, trunc.
Further functions are available:
void pascal setmantissawords(int number);
int pascal getmantissawords(void);
void pascal reseterror(void);
char pascal geterror(void);
int pascal strtomf(mfloat &a, const char *s);
char * pascal mftoa(char *str, const mfloat &a, int len);
char * pascal mftostr(char *str, const mfloat &a, int len,
const char * format);
double pascal mftod(const mfloat &b);
long double pascal mftold(const mfloat &b);
All functions are mapped to the functions defined in MFLOAT.H. So please
take a look in description for using the functions with Borland C above.
===========================================================================
***************************************************************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************************
===========================================================================
Using the subroutines for TURBO PASCAL 6.0, 7.0 or BORLAND PASCAL 7.0 :
=======================================================================
CONST MfloatWords = 16;
TYPE mfloat = ARRAY[0..MfloatWords-1] OF integer;
(declared in "PFLOAT.PAS")
This type declaration for multiprecision floating point numbers sets
a maximal precision of 15 mantissawords or about 72 decimal digits. The
maximum precision is an assembly time constant. In this release it is fixed
to 15 mantissawords. It is possible to change the type declaration to reduce
the number of mantissawords. This leads to a saving in memory needs and to
higher execution speed at the cost of precision. It is also possible to adjust
the precision dynamically using the function 'setmantissawords'.
The programmer must take care, that the stack is large enough, because
there is no check of a stack overflow.
****************************************************************************
++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
****************************************************************************
void far pascal setmantissawords(int number);
****************************************************************************
++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
****************************************************************************
PROCEDURE Setmantissawords(number : integer);
This procedure sets the length of the mantissa of mfloat numbers. The number
of word must be in the range of [1..15], otherwise is is bounded to these
values. If the number of mantissawords is increased and the numbers of
former calculations are used further, the additional words are not
initialized to zero, their values are not known. The error of the number
due to this further mantissa words is of course very small and is in the
range of the accuracy of the former calculation. If it is important, that
this further mantissa words are zero, the user has to initialize them (see
data format). If the number of mantissa words is reduced, every number can
be seen as truncated, although nothing is changed at this numbers. The number
of mantissa words refers only to calculations.
****************************************************************************
FUNCTION GetMantissawords : integer;
The function returns the actual number of mantissa words.
****************************************************************************
PROCEDURE ResetError;
The error flag is reset.
****************************************************************************
FUNCTION GetError : boolean;
The error flag is returned. If there ocures a runtime error during the
calculation, the error flag is set to true.
****************************************************************************
PROCEDURE equm(VAR a, b : mfloat); { *** a <-- b *** }
The value of b is copied to variable a with the defined accuracy. The value
of b is not changed.
****************************************************************************
PROCEDURE addm(VAR a, b : mfloat); { *** a <-- a + b *** }
The value of b is added to the value a with the defined accuracy. The
result is variable a. The value of b is not changed.
****************************************************************************
PROCEDURE subm(VAR a, b : mfloat); { *** a <-- a - b *** }
The value of b is subtracted from the value a with the defined accuracy.
The result is variable a. The value of b is not changed.
****************************************************************************
PROCEDURE multm(VAR a, b : mfloat); { *** a <-- a * b *** }
The value of a is multiplied by value b with the defined accuracy.
The result is variable a. The value of b is not changed.
****************************************************************************
PROCEDURE divm(VAR a, b : mfloat); { *** a <-- a / b *** }
The value of a is divided by the value b with the defined accuracy.
The result is variable a. The value of b is not changed.
****************************************************************************
PROCEDURE multi(VAR a : mfloat; b : integer); { *** a <-- a * b *** }
The value of a is multiplied by the integer value b with the defined
accuracy. The result is variable a.
****************************************************************************
PROCEDURE divi(VAR a : mfloat; b : integer); { *** a <-- a / b *** }
The value of a is divided by the integer value b with the defined accuracy.
The result is variable a.
****************************************************************************
PROCEDURE inversm(VAR a : mfloat); { *** a <-- 1 / a *** }
The number one is divided by the number a with the defined accuracy.
The result is variable a.
****************************************************************************
PROCEDURE negm(VAR a : mfloat); { *** a <-- - a *** }
This procedure changes the sign of the number of variable a.
****************************************************************************
FUNCTION eqZero(VAR a : mfloat) : boolean; { *** eqZero <-- a = 0 *** }
This function checks, if the multiprecision number a zero. The value of a
is not changed.
****************************************************************************
FUNCTION gtZero(VAR a : mfloat) : boolean; { *** gtZero <-- a > 0 *** }
This function checks, if the multiprecision number a is greater than zero.
The value of a is not changed.
****************************************************************************
FUNCTION geZero(VAR a : mfloat) : boolean; { *** geZero <-- a >= 0 *** }
This function checks, if the multiprecision number a is greater than zero or
equal zero. The value of a is not changed.
****************************************************************************
FUNCTION gtm(VAR a, b : mfloat) : boolean; { *** gtm <-- a > b *** }
This function checks, if the multiprecision number a is greater than the
multiprecision number b. The values of a and b are not changed.
****************************************************************************
FUNCTION eqm(VAR a, b : mfloat) : boolean; { *** eqm <-- a = b *** }
This function checks, if the multiprecision number a is equal to the
multiprecision number b. The values of a and b are not changed.
****************************************************************************
PROCEDURE GetZerom(VAR a : mfloat); { *** a <- 0 *** }
This procedure copies the value zero = 0.0000.. to the variable a.
****************************************************************************
PROCEDURE GetOnem(VAR a : mfloat); { *** a <- 1 *** }
This procedure copies the value one = 1.0000.. to the variable a.
****************************************************************************
PROCEDURE GetPim(VAR a : mfloat); { *** a <- pi *** }
This procedure copies the value pi = 3.14159.. to the variable a.
****************************************************************************
PROCEDURE GetLn2m(VAR a : mfloat); { *** a <- ln(2) *** }
This procedure copies the value ln(2) = 0.69314.. to the variable a.
****************************************************************************
PROCEDURE GetLn10m(VAR a : mfloat); { *** a <- ln(10) *** }
This procedure copies the value ln(10) = 2.3025.. to the variable a.
****************************************************************************
FUNCTION strtomf(VAR a : mfloat; { *** a <-- string *** }
b : string) : integer;
This procedure converts the string b to the multiprecision number a. If the
conversion is correct, the integer zero is returned. Otherwise the number
of the character, where the first error occurs, is retured. For conversion
the accuracy of the calculation is set to maximum and afterwards is set back
to the original value. This function is not optimized for speed, but for
accuracy, because it is not used very often normally.
****************************************************************************
FUNCTION mftoa(VAR a : mfloat; { *** string <-- a *** }
len : integer) : string;
This procedure converts the multiprecision number a to a string str with
the number of ASCII - characters, which is defined in len. If the number
of characters is too less, the string is set to stars '*'. The format used
for this conversion is '.32767F' (see FUNCTION mftostr). For conversion the
accuracy of the calculation is set to maximum and afterwards is set back to
the original value. This function is not optimized for speed, but for
accuracy, because it is not used very often normally.
****************************************************************************
FUNCTION mftostr(VAR a : mfloat; { *** string <-- a *** }
len : integer;
format : string) : string;
This procedure converts the multiprecision number a to a string str with
a maximum number of ASCII - characters, which is defined in len. If the
number of characters is too less, the string is set to stars '*'.
The characters in the format - string have following meaning:
'.' the point is shown ervery time, not significant zeros behind the point
are shown
'E' scientific format with an exponent of ten
'F' fixpoint format, if the number is too large, a scientific format
with an exponent of ten is used
'G' fixpoint format, if the number is too large, or less than 0.0001,
a scientific format is used.
'1..' an integer, which declares the number of digits behind the point
for fixpoint and scientific format
The order of the characters (but not the digits for the precision) in the
format string is arbitrary. It is also allowed to use lower case characters
for the format, but in this case the letter for the exponent of ten is
also the lower case letter 'e'. For conversion the accuracy of the
calculation is set to maximum and afterwards is set back to the original
value. This function is not optimized for speed, but for accuracy, because
it is not used very often normally.
****************************************************************************
FUNCTION MfToD(VAR a : mfloat) : double; { *** MfToD <- a *** }
This function converts a multiprecision number to a double precision number.
The value of a is not changed.
****************************************************************************
FUNCTION MfToLd(VAR a : mfloat) : extended; { *** MfToLd <- a *** }
This function converts a multiprecision number to an extended number. The
value of a is not changed.
****************************************************************************
PROCEDURE DToMf(VAR a : mfloat; b : double); { *** a <- b *** }
This procedure converts a double precision number to a multiprecision number.
****************************************************************************
PROCEDURE LdToMf(VAR a : mfloat; b : extended);{ *** a <- b *** }
This procedure converts an extended number to a multiprecision number.
****************************************************************************
++++++++++++++++ standart functions (Borland C: MATH.H) ++++++++++++++
****************************************************************************
PROCEDURE acosm(VAR a : mfloat); { *** a <- arccos(a) *** }
This procedure calculates the invers cos(x) function of the argument a. The
result is the variable a.
-1 <= a <= 1 range of the argument, otherwise the error flag is set
0 <= a <= pi range of the result
****************************************************************************
PROCEDURE asinm(VAR a : mfloat); { *** a <- arcsin(a) *** }
This procedure calculates the invers sin(x) function of the argument a. The
result is the variable a.
-1 <= a <= 1 range of the argument, otherwise the error flag is set
-pi/2 <= a <= pi/2 range of the result
****************************************************************************
PROCEDURE atanm(VAR a : mfloat); { *** a <- arctan(a) *** }
This procedure calculates the invers tan(x) function of the argument a. The
result is the variable a.
-pi/2 < a < pi/2 range of the result
****************************************************************************
PROCEDURE atan2m(VAR a, b : mfloat); { *** a <- arg(b + ia) *** }
This function calulates the extended arctan function phi = arctan(a/b). If
a is zero or less than zero, the correct value phi is retured, which is
in the intervall (-pi, pi]:
a = r * sin(phi)
b = r * cos(phi)
r = sqrt(sqr(a) + sqr(b)) >= 0
If a = 0 and b = 0, the value phi is not unique. The error flag is set and
the number zero is returned. The return variable is the variable a. Variable
b is not changed.
****************************************************************************
PROCEDURE ceilm(VAR a : mfloat); { *** a <-- ceil(a) *** }
The least integer number, which is greater than or equal the mfloat number
a, is calculated and returned to the variable a.
****************************************************************************
PROCEDURE cosm(VAR a : mfloat); { *** a <- cos(a) *** }
This procedure calculates the cos(x) function of the argument a. The result
is the variable a.
****************************************************************************
PROCEDURE coshm(VAR a : mfloat); { *** a <- cosh(a) *** }
This procedure calculates the cosh(x) function of the argument a. The result
is the variable a.
-22712 < a < 22712 range of the argument without an overflow
a >= 1 range of the result
****************************************************************************
PROCEDURE expm(VAR a : mfloat); { *** a <- exp(a) *** }
This procedure calculates exponential function exp(x) of the argument a. The
result is the variable a.
a < 22712 range of the argument without an overflow
a < -22712 there is an underflow result = 0
a >= 0 range of the result
****************************************************************************
PROCEDURE fabsm(VAR a : mfloat); { *** a <-- fabs(a) *** }
This procedure returns the absolute value of the number a. The result is
the variable a.
****************************************************************************
PROCEDURE floorm(VAR a : mfloat); { *** a <-- floor(a) *** }
The greatest integer number, which is less than or equal the mfloat number
a, is calculated and returned to the variable a.
****************************************************************************
PROCEDURE fmodm(VAR a, b : mfloat); { *** a <-- fmod(a,b) *** }
This is a floating point modul function. The number a is divided by b, so
that the result is an integer. The rest of this division is returned by the
variable a. The variable b is not changed.
a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a
If the variable b is zero, the error flag is set. For negative b, the result
is negativ or zero. For very large k, which can not be represented as an
mfloat number exactly, the result is zero.
****************************************************************************
PROCEDURE frexpm(VAR a : mfloat; { *** a <-- frexp(a,b) *** }
VAR b : integer);
The mfloat number a is splitted in the exponent and in the mantissa:
a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a
This procedure returns two results. IF the mfloat number is zero, it remains
zero and b is set to zero.
****************************************************************************
PROCEDURE hypotm(VAR a, b : mfloat); { *** a <-- hypot(a,b) *** }
This function is used for the calculation of the absolute value of a complex
number:
c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged.
|a + i*b| = hypot(a,b);
This procedure calculates the correct result also if sqr(a) can not be
represented as a mfloat number, because it is too great or too small.
****************************************************************************
PROCEDURE ldexpm(VAR a : mfloat; b : integer); { *** a <-- ldexp(a,b) *** }
The mfloat number a is multiplied by the power of two 2**b.
a <-- a * (2**b);
The calculation is very fast because only the exponent of two is manipulated.
****************************************************************************
PROCEDURE logm(VAR a : mfloat); { *** a <- log(a) *** }
This procedure calculates the natural logarithm ln(x) of the argument a. The
result is the variable a.
a > 0 range of the argument
-22712 <= a <= 22712 range of the result
****************************************************************************
PROCEDURE log10m(VAR a : mfloat); { *** a <- log10(a) *** }
This procedure calculates the decade logarithm log(x) of the argument a. The
result is the variable a.
a > 0 range of the argument
-9864 <= a <= 9863 range of the result
****************************************************************************
PROCEDURE modfm(VAR a, b : mfloat); { *** a, b <-- modf(a) *** }
This procedure splits the number a in an integer part and in an fractional
part:
a = k + b; k is an integer; 0 <= b < 1; k is assigned to a
****************************************************************************
PROCEDURE powm(VAR a, b : mfloat); { *** a <-- pow(a,b) *** }
This procedure calculates the power of the basis a by the exponent b.
The result is assigned to the variable a, the variable b is not changed.
If the exponent is an integer, the calculation is made by repeated
multiplications. The number of multiplications is proportional to
log(|b|). If b has a fractional, the calculation is made by logarithm. In
this case, the base a must be positiv, otherwise the error flag is set,
because the result is not real. If the number b is very large so that
no fractional can be represented by the mfloat format and the base
is negativ, the error flag is set too. If a = 0 and b <= 0 the result is
not defined and the error flag is set.
****************************************************************************
PROCEDURE pow10m(VAR a : mfloat; b : integer); { *** a <-- pow10(a) *** }
The power of the basis ten is build with the integer exponent b. The result
is assigned to a.
a <-- 10**b
if b < -9864 then a = 0
if b > 9863 then there is an overflow -> error flag is set
****************************************************************************
PROCEDURE sinm(VAR a : mfloat); { *** a <- sin(a) *** }
This procedure calculates the sin(x) function of the argument a. The result
is the variable a.
****************************************************************************
PROCEDURE sinhm(VAR a : mfloat); { *** a <- sinh(a) *** }
This procedure calculates the sinh(x) function of the argument a. The result
is the variable a.
-22712 < a < 22712 range of the argument without an overflow
****************************************************************************
PROCEDURE sqrtm(VAR a : mfloat); { *** a <- sqrt(a) *** }
This procedure calculates the square root of the number a. The result is
variable a. If a is negativ, the error flag is set.
****************************************************************************
PROCEDURE tanm(VAR a : mfloat); { *** a <- tan(a) *** }
This procedure calculates the tan(x) function of the argument a. The result
is the variable a. It is faster than the calculation of sin(a) / cos(a).
****************************************************************************
PROCEDURE tanhm(VAR a : mfloat); { *** a <- tanh(a) *** }
This procedure calculates the tanh(x) function of the argument a. The result
is the variable a.
-1 <= a <= 1 range of the result
****************************************************************************
++++++++++++++++++ extended standart functions ++++++++++++++++++++++++
****************************************************************************
PROCEDURE acoshm(VAR a : mfloat); { *** a <- arcosh(a) *** }
This procedure calculates the invers cosh(x) function of the argument a. The
result is the variable a.
1 <= a range of the argument
0 <= a < 22712 range of the result
****************************************************************************
PROCEDURE acotm(VAR a : mfloat); { *** a <- arccot(a) *** }
This procedure calculates the invers cot(x) function of the argument a. The
result is the variable a.
0 < a < pi range of the result
****************************************************************************
PROCEDURE acothm(VAR a : mfloat); { *** a <- arcoth(a) *** }
This procedure calculates the invers coth(x) function of the argument a. The
result is the variable a.
-1 > a or 1 < a range of the argument
****************************************************************************
PROCEDURE asinhm(VAR a : mfloat); { *** a <- arsinh(a) *** }
This procedure calculates the invers sinh(x) function of the argument a. The
result is the variable a.
****************************************************************************
PROCEDURE atanhm(VAR a : mfloat); { *** a <- artanh(a) *** }
This procedure calculates the invers tanh(x) function of the argument a. The
result is the variable a.
-1 < a < 1 range of the argument
****************************************************************************
PROCEDURE cossinm(VAR a, b : mfloat); { *** a <-- cos(a), b <-- sin(a) *** }
This procedure calculates the sin(x) and cos(x) function by only one
evulation of the series for the sine function. This procedure is therefore
faster than two individual calculations. It is recommended for example
for the calculation of the exponential function of complex arguments.
****************************************************************************
PROCEDURE cotm(VAR a : mfloat); { *** a <- cot(a) *** }
This procedure calculates the cot(x) function of the argument a. The result
is the variable a. It is faster than the calculation of cos(a) / sin(a).
****************************************************************************
PROCEDURE cothm(VAR a : mfloat); { *** a <- coth(a) *** }
This procedure calculates the coth(x) function of the argument a. The result
is the variable a.
range of the result: a >=1 or a <= -1
****************************************************************************
PROCEDURE exp10m(VAR a : mfloat); { *** a <- 10 ** a *** }
This procedure calculates the power of 10 of the argument a. The
result is the variable a.
a < 9863 range of the argument without an overflow
a < -9864 there is an underflow, result = 0
a >= 0 range of the result
****************************************************************************
PROCEDURE sqrm(VAR a : mfloat); { *** a <- sqr(a) *** }
This procedure calculates the sqare of a. The result is variabel a.
sqr(a) = a * a
****************************************************************************
PROCEDURE truncm(VAR a : mfloat); { *** a <-- trunc(a) *** }
This procedure returns the integer part of the number a (truncation). The
result is variable a. The sign has no influence and remains unchanged.
===========================================================================
***************************************************************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************************
===========================================================================
Examples with MFLOAT:
a) The calculation of the number PI:
There are many methods known to calculate the irrational number pi,
which is the relation between the perimeter and the diameter of a circle.
Pi = 16 * arctan(1 / 5) - 4 * arctan(1 / 239);
Pi = 48 * arctan(1 / 18) + 32 * arctan(1 / 57) - 20 * arctan(1 / 239);
Pi = 6 * arctan(1 / sqrt(3));
PI = 4 * (1 - 1/3 + 1/5 - 1/7 + ....)
The function arctan(x) is calculated with the series:
arctan(x) = x - x**3 / 3 + x**5 / 5 - x**7 / 7 + ......
The operation "x**.." means the power of x.
The advantage of the first two equations is, that they converge very
fast and every element of the series can be calculated recursivly from
the former element by only two divisions by an integer. A division by
an integer is very fast for long MFLOAT-numbers.
The third equation is build by the known value : tan(pi/6) = 1 / sqrt(3).
This method is slower than the former. The forth equation has only
theoretical importance, in the numerical use this equation is
very bad.
The example program compares the different methods to calculate pi.
b) The evulation of the bessel function for large arguments:
There are some series known, where some elements are very large and
the result is small. In this case numerical errors are very large.
An example is the bessel function J0(x) for an argument x = 100.0.
J0(x) = 1 - (x/2)**2 / (1! * 1!) + (x/2)**4 / (2! * 2!) - ......
The series converges very good, but there are problems with the accuracy
of the calculation. The test program shows the influence of the accuracy
of the calculation to the result.
c) Conversion of cartesian coordinates to polar coordinates:
This example shows the use of the functions hypot(x,y) and atan2(x,y).
This program also demonstrates the input of MFLOAT-numbers.
===========================================================================
***************************************************************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************************
===========================================================================
Graz, 28.4.1993
Kaufmann Friedrich & Mueller Walter
Students at the Technical University of Graz
Registration address:
Kaufmann Friedrich
Schuetzenhofgasse 22
A-8010 GRAZ
AUSTRIA
EUROPE
email address:
fkauf@fstgds06.tu-graz.ac.at